graph TD
A[House A<br/>Need to predict] -->|needs price of| B[House B<br/>Need to predict]
B -->|needs price of| C[House C<br/>Need to predict]
C -->|needs price of| A
style A fill:#e74c3c,stroke:#c0392b,color:#fff
style B fill:#e74c3c,stroke:#c0392b,color:#fff
style C fill:#e74c3c,stroke:#c0392b,color:#fff
WHY SPATIAL LAG MODELS DON’T WORK FOR PREDICTING
DON’T USE AI WITHOUT CRITICAL THINKING!
Quick Clarification
Some midterm teams suggested:
“To improve predictions, implement a spatial lag model (SAR) or spatial error model (SEM)”
Let’s revisit why this doesn’t work for prediction
(And what you should recommend instead)
First: What IS a Spatial Lag Model?
Standard regression: \[\text{Price}_i = \beta_0 + \beta_1(\text{Sqft}) + \beta_2(\text{Beds}) + \varepsilon\]
Spatial lag regression: \[\text{Price}_i = \beta_0 + \rho \times \color{red}{\text{Avg(Neighbor Prices)}} + \beta_1(\text{Sqft}) + \beta_2(\text{Beds}) + \varepsilon\]
Key difference: Your price depends on your neighbors’ actual prices
Question: “Do nearby house prices affect each other?” (spillover effects)
Used for: Understanding spatial processes, causal inference about neighborhood effects
The Problem for Prediction
Let’s work through two concrete scenarios where this breaks down:
- Temporal: Training on 2024, predicting 2025
- Transfer: Philadelphia model → Orlando
Scenario 1: The Temporal Problem
Training on 2024 data, predicting 2025 sales
Step 1: Estimate Model on 2024 Data
You fit this spatial lag model on 2024 Philadelphia sales:
# Your estimated model from 2024
model_2024 <- spatialreg::lagsarlm(
log(price) ~ sqft + bedrooms + bathrooms,
data = sales_2024,
listw = neighbors_weights
)Results:
Spatial lag coefficient (ρ) = 0.65
β_sqft = 0.00015
β_beds = 0.12
Interpretation: A 1% increase in neighbors’ prices → 0.65% increase in my price
This works for 2024 because all prices are known!
Step 2: Three Houses List in January 2025
Your prediction task:
| House | Sqft | Beds | Baths | 5 Nearest Neighbors |
|---|---|---|---|---|
| A | 1,500 | 3 | 2 | B, C, D, E, F |
| B | 1,800 | 3 | 2 | A, C, G, H, I |
| C | 2,000 | 4 | 3 | A, B, J, K, L |
What you know:
- ✓ Sqft, beds, baths for A, B, C
- ✓ Locations of A, B, C
- ✗ What A, B, C will actually sell for
Step 3: Try to Predict House A
Your model equation: \[\text{Price}_A = \beta_0 + 0.65 \times \color{red}{\text{Avg}(\text{Price}_B, \text{Price}_C, ...)} + 0.00015 \times 1500 + 0.12 \times 3\]
Problem: You need PriceB and PriceC to predict PriceA
But wait… PriceB and PriceC haven’t happened yet! They’re listed, not sold.
Step 4: Realize the Circular Dependency
Try to predict House B: \[\text{Price}_B = \beta_0 + 0.65 \times \color{red}{\text{Avg}(\text{Price}_A, \text{Price}_C, ...)} + ...\]
Try to predict House C: \[\text{Price}_C = \beta_0 + 0.65 \times \color{red}{\text{Avg}(\text{Price}_A, \text{Price}_B, ...)} + ...\]
Price_A needs Price_B and Price_C
Price_B needs Price_A and Price_C
Price_C needs Price_A and Price_B
You’re stuck in a circular dependency!
Visual: The Circular Dependency
All the unknowns depend on each other!
“But Can’t I Use Recent Sales?”
You might think: “I’ll use recent sales from December 2024 as the spatial lag”
- Problem 1: House A’s neighbors (B, C) haven’t sold - they’re ALSO new listings
- Problem 2: If you use OLD sales (from months ago), you’re predicting based on stale prices in a changing market
- Problem 3: What if it’s a new development? No recent sales exist nearby
- Problem 4: Your spatial lag coefficient (ρ = 0.65) was estimated assuming SIMULTANEOUS prices, not lagged prices
Bottom line: Spatial lag models assume all observations exist simultaneously. Prediction is inherently sequential.
Scenario 2: The Transfer Problem
Selling your Philadelphia model to Orlando
Your Sales Pitch to Orlando
You: “We built an amazing spatial lag model for Philadelphia! R² = 0.85!”
Orlando Chief Data Officer: “Great! We have 5,000 active listings. Can you predict their prices?”
You: “Sure! Just send me the data…”
(You open the file)
orlando_listings <- read_csv("orlando_new_listings.csv")
# Variables: address, sqft, beds, baths, lat, lon
# Missing: SALE_PRICE (that's what we're predicting!)You realize: “Wait… I need neighbors’ prices to predict each price…”
Orlando: “That’s why we hired you - these HAVEN’T sold yet!”
The Parameter Transfer Problem
Even if you had some recent Orlando sales to use:
Your Philadelphia model: \[\text{Price}_i = \beta_0 + \color{red}{0.65} \times \text{Avg}(\text{Neighbor Prices}) + \beta_1(\text{Sqft}) + ...\]
Questions:
- Was ρ = 0.65 estimated on Philadelphia’s price distribution (avg $350k)
- Orlando’s average is $280k - different scale
- Orlando is sprawling suburbs vs. Philadelphia’s dense rowhouses
- Does ρ = 0.65 even mean the same thing in Orlando?
Answer: No! You’d have to re-estimate the entire model on Orlando data.
So your “model” isn’t actually transferable.
Contrast: Spatial Features Transfer
If you had built with spatial FEATURES:
# Philadelphia model
model <- lm(log(price) ~ sqft + bedrooms +
dist_to_transit + parks_500ft +
median_income_tract,
data = philly)To use in Orlando:
# Get Orlando's spatial context (all observable!)
orlando$dist_to_transit <- get_transit_distance(orlando)
orlando$parks_500ft <- count_parks_buffer(orlando, 500)
orlando$median_income_tract <- get_census_data(orlando)
# Apply Philadelphia coefficients
orlando$predicted_price <- predict(model, newdata = orlando)This works because features are CONTEXT, not outcomes!
Visual Comparison
Spatial Lag (Fails)
graph TD
A[Predict<br/>House A] -->|needs| B[Price of B<br/>UNKNOWN]
A -->|needs| C[Price of C<br/>UNKNOWN]
style A fill:#e74c3c
style B fill:#e74c3c
style C fill:#e74c3c
Circular dependency
Spatial Features (Works)
graph TD
T[Transit: 0.3mi<br/>KNOWN] --> A[Predict<br/>House A]
P[Parks: 2<br/>KNOWN] --> A
I[Income: 65k<br/>KNOWN] --> A
style A fill:#27ae60
style T fill:#3498db
style P fill:#3498db
style I fill:#3498db
All inputs observable
So When ARE Spatial Lag Models Useful?
Spatial lag models are GREAT for:
Understanding spillover effects: “Does gentrification in one neighborhood cause price increases in adjacent neighborhoods?”
Causal inference: “Do nearby foreclosures depress my home value?”
Policy simulation: “If we build a park here, how will it affect the surrounding area?”
Cross-sectional analysis: Looking at ONE point in time where all prices exist
But NOT for:
- Predicting future sales
- Transferring models between cities
- Real-time valuation systems
- Out-of-sample forecasting
What High Moran’s I Actually Tells You
If your model errors have Moran’s I = 0.58 (high spatial clustering):
❌ Wrong response: “Switch to spatial lag model”
✓ Right response: “Add better spatial features!”
Instead of:** “Implement a spatial lag model”
Say this:
“The high Moran’s I (0.58) indicates spatial clustering in errors, suggesting our model is missing important location-based predictors. Recommendations:
- Vary buffer distances - Currently using 500ft; try 250ft, 1000ft, 1500ft
- Add more amenities - Coffee shops, grocery stores, restaurants, crime incidents
- Richer census data - Use block group instead of tract; add commute time variables
- Neighborhood interactions - sqft × neighborhood, age × distance_downtown
- Time-varying features - Recent building permits, development activity
- More granular fixed effects - Census block group FE instead of neighborhood FE”
The Key Distinction
SPATIAL FEATURES (What you observe about a location)
- Distance to transit
- Parks within buffer
- Median income
- Crime rate
- School quality
- Walkability score
For prediction: ✓ Always observable
SPATIAL LAG (What neighbors’ outcomes are)
- Average neighbor price
- Neighbor sale date
- Neighbor appreciation rate
For prediction: ✗ Creates circular dependency
Fix spatial autocorrelation by improving FEATURES, not by changing model structure
Machine Learning Doesn’t Fix This
Some might think: “I’ll use a Random Forest with neighbors’ prices as a feature!”
# This STILL doesn't work for prediction
rf_model <- randomForest(
price ~ sqft + bedrooms + avg_neighbor_price, # ⚠️
data = train_2024
)
# Try to predict 2025
predictions_2025 <- predict(rf_model, newdata = new_listings)
# ↑
# avg_neighbor_price is MISSING!The problem isn’t the model TYPE, it’s the LOGIC:
If a feature requires knowing other predictions first, it’s not a valid predictor.
Final Thought
Always understand what a method is designed for before recommending it.